System and Method of Creating and Using Compact Linguistic Data

ABSTRACT

A system and method of creating and using compact linguistic data are provided. Frequencies of words appearing in a corpus are calculated. Each unique character in the words is mapped to a character index, and characters in the words are replaced with the character indexes. Sequences of characters are mapped to substitution indexes, and the sequences of characters in the words are replaced with the substitution indexes. The words are grouped by common prefixes, and each prefix is mapped to location information for the group of words which start with the prefix.

CROSS-REFERENCE TO RELATED APPLICATION

This is a continuation of U.S. patent application Ser. No. 10/289,656,filed on Nov. 7, 2002, which claims priority from U.S. ProvisionalApplication Ser. No. 60/393,903, filed on Jul. 3, 2002.

BACKGROUND

1. Field of the Invention

The present invention relates in general to linguistic data, and inparticular to storage and use of the linguistic data for text processingand text input.

2. Description of the State of the Art

The growing use of mobile devices and different types of embeddedsystems challenges the developers and manufacturers of these devices tocreate products that require minimal memory usage, yet perform well. Akey element of these products is the user interface, which typicallyenables a user to enter text which is processed by the product.

One application of linguistic data is to facilitate text entry bypredicting word completions based on the first characters of a word thatare entered by a user. Given a set of predictions that are retrievedfrom the linguistic data, the user may select one of the predictions,and thus not have to enter the remaining characters in the word.

The prediction of user input is especially useful when included in amobile device, since such devices typically have input devices,including keyboards, that are constrained in size. Input predictionminimizes the number of keystrokes required to enter words on suchdevices.

Input prediction is also useful when text is entered using a reducedkeyboard. A reduced keyboard has fewer keys than characters that can beentered, thus keystroke combinations are ambiguous. A system that useslinguistic data for input prediction allows the user to easily resolvesuch ambiguities. Linguistic data can also be used to disambiguateindividual keystrokes that are entered using a reduced keyboard.

Existing solutions for storage of linguistic data used for text inputand processing typically rely on hash tables, trees, linguisticdatabases or plain word lists. The number of words covered by theselinguistic data formats is limited to the words which have been stored.

The linguistic data which is used in existing text input predictionsystems is typically derived from a body of language, either text orspeech, known as a corpus. A corpus has uses such as analysis oflanguage to establish its characteristics, analysis of human behavior interms of use of language in certain situations, training a system toadapt its behavior to particular linguistic circumstances, verifyingempirically a theory concerning language, or providing a test set for alanguage engineering technique or application to establish how well itworks in practice. There are national corpora of hundreds of millions ofwords and there are also corpora which are constructed for particularpurposes. An example of a purpose-specific corpus is one comprised ofrecordings of car drivers speaking to a simulation of a voice-operatedcontrol system that recognizes spoken commands. An example of a nationalcorpus is the English language.

SUMMARY

A system of creating compact linguistic data is provided. The systemcomprises a corpus and linguistic data analyzer. The linguistic dataanalyzer calculates frequencies of words appearing in the corpus. Thelinguistic data analyzer also maps each unique character in the words toa character index, and replaces each character in the words with thecharacter index to which the character is mapped. The linguistic dataanalyzer also maps sequences of characters that appear, in the words tosubstitution indexes, and replaces each sequence of characters in eachword with the substitution index to which the sequence of characters aremapped. The linguistic data analyzer also arranges the words into groupswhere each group contains words that start with a common prefix, andmaps each prefix to location information for the group of words whichstart with the prefix. The compact linguistic data includes the uniquecharacters, the character indexes, the substitution indexes, thelocation information, the groups of words and the frequencies of thewords.

A compact linguistic data structure for a plurality of words is alsoprovided. The words are organized into groups, each group containingwords that have a common prefix. The compact linguistic data structurecomprises an alphabet comprised of each unique character in the words, acharacter-mapping table for mapping each character in the alphabet to acharacter index, a substitution table for mapping sequences ofcharacters from the words to substitution indexes, and a plurality ofword definition tables for storing the words. Each word definition tablestores each of the words included in one of the groups. The compactlinguistic data structure further comprises an offset table for locatingthe word definition tables. For each of the common prefixes, the offsettable contains a location of the word definition table which storeswords starting with the common prefix. Each of the words in the worddefinition tables is encoded by replacing each character in the wordwith the character index to which the character is mapped by thecharacter-mapping table, and by replacing each sequence of charactersfrom the substitution table that appears in the word with thesubstitution index to which the sequence of characters is mapped by thesubstitution table. The common prefixes for words in each worddefinition table are removed.

A method of creating compact linguistic data is also provided. Themethod begin with a step of creating a word-list comprising a pluralityof words occurring most frequently in a corpus. The method continueswith a step of sorting the words in the word-list alphabetically. Themethod continues with a step of creating a character-mapping table forencoding the words in the word-list by replacing characters in the wordswith associated character indexes contained in the character-mappingtable. The method continues with a step of separating the words in theword-list into groups, wherein words in each group have a common prefix.The method continues with a step of creating a substitution table forencoding the words in the groups by replacing character sequences in thewords in the groups with substitution indexes that are mapped to thecharacter sequences by the substitution table. The method continues witha step of encoding the words in the groups into byte sequences using thecharacter-mapping table and the substitution table. The method continueswith a step of creating word definition tables and storing the encodedwords in the word definition tables. The method continues with a step ofcreating an offset table for locating groups of encoded words. Themethod ends with a step of storing the character-mapping table, thesubstitution table, the word definition tables, and the offset table.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a system in which linguistic data is usedfor text input prediction;

FIG. 2 is a block diagram of a system of creating compact linguisticdata;

FIG. 3 is flowchart illustrating a method of filtering source files;

FIG. 4 is flowchart illustrating a method of word frequency calculation;

FIG. 5 is a flowchart illustrating a method creating compact linguisticdata;

FIG. 6 is a block diagram of a format of compact linguistic data;

FIG. 7 is a block diagram of a complex word definition table;

FIG. 8 is a flowchart illustrating a method of frequency modification;and

FIG. 9 is a flowchart illustrating a method of inflection analysis.

DETAILED DESCRIPTION

A system and method of creating and using compact linguistic data, basedon word prefix indexing with statistical character substitution, isprovided. The method by which the system stores the linguistic datarequires minimal memory usage and provides very fast access to wordswhich begin with a specified prefix and their associated frequencies.

FIG. 1 is a block diagram of a system in which linguistic data is usedfor text input prediction. The system includes linguistic data 100, atext input logic unit 102, and a user interface 103. The system can beimplemented on any computing device requiring text input, but isespecially suited for embedded devices with a slow CPU and significantRAM and ROM limitations, such as a mobile communication device.

The user interface 103 includes a text input device 104, which allows auser to enter text into the system. The text input device 104 is anydevice that enables text entry, such a QWERTY, AZERTY or Dvorakkeyboard, or a reduced keyboard. The user interface 103 also includes atext output device 106, which displays text to a user. The text outputdevice 106 may be a graphical component presented on the screen of amobile device or computer.

The linguistic data 100 is based on word prefix indexing withstatistical character substitution, and is described in more detailbelow.

The text input logic unit 102 may, for example, be implemented bycomputer instructions which are executed by a computer processor that iscontained in a mobile device.

The text input logic unit 102 receives text that is entered by a userusing the text input device 104. The text input logic unit 102 then usesthe text output device 106 to present the user with predictions of wordsthat the user has started to enter. The predictions are the mostprobable complete words that start with prefixes entered as text by theuser, and are retrieved by the text input logic unit 102 from thelinguistic data 100. The user may then select one of the predictionsusing the text input device 104.

Where the text input device 104 is a reduced keyboard, the text inputlogic unit 102 also disambiguates individual keystrokes that arereceived from the reduced keyboard, presenting the user with the mostprobable characters based on words in the linguistic data 100.

FIG. 2 is a block diagram of a system of creating compact linguisticdata. The linguistic data analyzer 202 creates linguistic data 204,described in detail below, by analyzing the corpus 200 of a naturallanguage, such as English or French. The linguistic data analyzer 202calculates frequencies of words appearing in the corpus 200, maps eachunique character in the words to a character index, replaces charactersin the words with the character indexes to which the characters aremapped, maps sequences of characters that appear in the words tosubstitution indexes, replaces the sequences of characters in the wordswith the substitution indexes to which the sequences of characters aremapped, arranges the words into groups where each group contains wordsthat start with a common prefix, and maps each prefix to locationinformation for the group of words which start with the prefix.

The analysis of the corpus 200 by the linguistic data analyzer 202includes the calculation of the absolute frequency of the unique wordsappearing in the corpus 200. Methods for the calculation of frequencyand creation of a word-list are described in FIG. 3 and FIG. 4. Once aword-list has been derived from the corpus 200, the word-list is used tocreate the linguistic data 204. The linguistic data 204 includes theunique characters, the character indexes, the substitution indexes, thelocation information, the groups of words and the frequencies of thewords. A method for creating the linguistic data 204 is described inFIG. 5. The linguistic data 204 produced by the linguistic data analyzer202 illustrated in FIG. 6.

The absolute frequency of a certain group of words found in the corpus200 may alternatively be modified by separating this group to adifferent file and assigning a custom weight to this file. This groupmay consist of words which are domain specific, such as names of placesor medical terms, and which, based on user preferences, must be includedin the resulting word-list. As a result, the absolute value of thefrequencies for this group of words will be modified using the weightassigned to the group, so that this group of words will have frequenciesthat are different they would have otherwise had.

FIG. 3 is flowchart illustrating a method of filtering source files. Thesource files contain text which comprises a corpus. The filtering methodis the first step in calculating the frequency of words in the corpus.

The method begins with the step 300 of reading the contents a sourcefile. After the source file is read, the method continues with the step302 of performing substitution of text from the file according to userpreferences, which may be stored in a properties file. The userpreferences specify regular expressions which are applied to the text inorder to substitute invalid or unwanted characters. For example, a usermay not want street names included in the word list, or an Italian usermay want to replace “e” followed by a non-letter with “e”, or a user maywant to skip the last sentence of a text when it is expected that thelast sentence contains only the author's name.

The method then continues with the step 304 of obtaining a filtercorresponding to the type indicated by the file extension of the sourcefile. For example, if the file extension is “.xml”, it is assumed thatthe file contains an eXtensible Markup Language (XML) document, so anXML filter is obtained. Similarly, if the file extension is “.html”,then a HyperText Markup Language (HTML) filter is obtained, and if thefile extension is “.txt”, then a text filter is obtained. Other fileextensions may also be mapped to additional filters.

The filter obtained at step 304 is then applied at step 306 in order toremove words which are not part of the corpus, but rather are part offormat definitions. For example, an XML filter removes mark-up tags fromthe text read from the file.

The method continues with the step 308 of extracting the words from thedata resulting from step 306, and writing the extracted words to afiltered-words file at step 310.

If it is determined at step 312 that there are more source files tofilter, then the method continues at step 300. Otherwise, the methodends at step 314. When the method ends, all of the source files whichcomprise the corpus have been filtered.

FIG. 4 is flowchart illustrating a method of word frequency calculation.The method utilizes the filtered-words files that were produced by themethod illustrated in FIG. 3. The words from the filtered-words file areloaded into a word-tree. The word-tree is an effective structure tostore unique words and their frequencies using minimal memory. The treeis organized such that words that occur frequently in the filtered-wordsfiles are located in the inner nodes of the tree, and words that occurless frequently are located in the leaves of the tree. Each node of thetree contains a unique word and the word's absolute frequency. Words areadded and deleted from the tree in a fashion that assures that the treeremains balanced.

The method begins with the step 400 of reading a filtered-words file.The method continues with the step 402 of reading a word from thefilter-words file and adding it into the word-tree, if the word is notalready in the word-tree. The frequency associated with the word in thetree is incremented.

The method continues at step 404, where it is determined if the numberof nodes in the tree exceeds a predefined limit, which may be specifiedin a properties file. If the size of the word-tree does not exceed thelimit, then the method continues at step 408. Otherwise, the methodcontinues at step 406.

At step 406, the word-tree is shrunk so that it no longer exceeds thesize limit. The tree is shrunk by deleting the least-frequently usedwords from the tree, which are located in the leaf nodes. The methodthen continues at step 408.

Step 408 determines whether there are any filtered words left in thefiltered-words file. If there are, then the method continues at step402. If there are no filtered words left, then the method continues atstep 410.

Step 410 determines whether there are any remaining filtered-words filesto process. If there are, then the method continues at step 400.Otherwise, the method continues at step 412.

At step 412, a word-list which stores words which have been added to theword-tree and their frequencies are written to an output file.

The method illustrated in FIG. 4 allows even very large corpora to beprocessed by a single computer. The resulting word-list contains up to apredefined limited number of most frequently occurring words in thecorpus, and the absolute frequencies associated with the words.

FIG. 5 is a flowchart illustrating a method creating compact linguisticdata. The method uses a word-list containing word frequency informationto produce compact linguistic data, and includes word prefix indexingand statistical character substitution.

The method beings at step 500, where the word-list is read from anoutput file that was produced by a method of word frequency calculationsuch as the method illustrated in FIG. 4. The words in the word-list arethen sorted alphabetically.

The method continues with step 501 of normalizing the absolutefrequencies in the word-list. Each absolute frequency is replaced by arelative frequency. Absolute frequencies are mapped to relativefrequencies by applying a function, which may be specified by a user.Possible functions include a parabolic, Gaussian, hyperbolic or lineardistribution.

The method continues with the step 502 of creating a character-mappingtable. The character-mapping table is used to encode words in asubsequent step of the method. When encoding is performed, thecharacters in the original words are replaced with the character indexesof those characters in the character-mapping table. Since the size ofthe alphabet for alphabetical languages is much less than 256, a singlebyte is enough to store Unicode character data. For example, the Unicodecharacter 0x3600 can be represented as 10 if it is located at index 10in the character-mapping table. The location of a character in thecharacter-mapping table is not significant, and is based on the orderthat characters appear in the given word-list.

The method continues with the step 504 of separating the words in theword-list into groups. Words in each group have a common prefix of agiven length and are sorted by frequency. Words are initially grouped byprefixes that are two characters long. If there are more than 256 wordsthat start with the same two-character prefix, then additionalseparation will be performed with longer prefixes. For example, if theword-list contains 520 words with the prefix “co”, then this group willbe separated into groups with prefixes “corn”, “con”, and so on.

The method continues with the step 506 of producing a frequency set foreach group of words. In order to reduce the amount of space required tostore frequency information, only the maximum frequency of words in eachgroup is retained with full precision. The frequency of each other wordis retained as a percentage of the maximum frequency of words in itsgroup. This technique causes some loss of accuracy, but this isacceptable for the purpose of text input prediction, and results in asmaller storage requirement for frequency information.

The method continues with step 508. In order to reduce the amount ofdata required to store the words in the word-list, the charactersequences that occur most frequently in the words are replaced withsubstitution indexes. The substitution of n-grams, which are sequencesof n-number of characters, enables a number of characters to berepresented by a single character. This information is stored in asubstitution table. The substitution table is indexed, so that eachn-gram is mapped to a substitution index. The words can then becompacted by replacing each n-gram with its substitution index in thesubstitution table each time the n-gram appears in a word.

The method continues with step 510 of encoding the word groups into bytesequences using the character-mapping table and the substitution table,as described above. The prefixes used to collect words into groups areremoved from the words themselves. As a result, each word is representedby a byte sequence, which includes all the data required to find theoriginal word, given its prefix.

The method continues with step 511 of creating word definition tables.The word definition tables store the frequency sets calculated at step506 and the encoded words produced at 510.

The method continues with step 512 of creating an offset table. Theoffset table contains byte sequences that represent the groups of words.This table enables the identification of the start of a byte sequencesthat represents a particular word group. The offset table is used tolocate the byte sequences that comprise the encoded words for aparticular group that start with a common prefix.

The method concludes with step 514. At this step, the linguistic dataresulting from the method has been stored in the tables that have beencreated. The data tables, including the character-mapping table, thesubstitution table, the offset table and the word definition tables, arestored in an output file.

Statistical data gathered during the method of creating compactlinguistic data may optionally be stored at step 514. The statisticaldata includes the frequency with which n-grams stored in thesubstitution table appear in words in the linguistic data, the number ofwords in the linguistic data, word-list and corpus from which theword-list was generated, and ratios between the numbers of words in thelinguistic data, word-list and corpus.

FIG. 6 is a block diagram of a format of compact linguistic data. Theprimary objective of the data format is to preserve the simplicity ofinterpretation of the linguistic data, while minimizing memory use andthe number of computer instructions required to create and interpret thedata. Linguistic data in the format is produced by the linguistic dataanalyzer 202 (FIG. 2), and is the output of the method illustrated byFIG. 5.

The format allows linguistic data to be stored with or without wordfrequency information. When the linguistic data includes frequencyinformation, learning capabilities, which are described below, can beimplemented, and the data can be used to predict input entered with areduced keyboard. If frequency information is not included, then wordswhich are less than three characters long are not included, since theywill not be useful for predicting user input.

The format defines the structure of a computer file which contains aheader 602 followed by a number of tables.

The header 602 contains a signature including a magic number, which is anumber identifying the format of the file. The header 602 also containsinformation which specifies the version and priority of the linguisticdata contained in the file. Priority information is used to assignrelative importance to the linguistic data when multiple filescontaining linguistic data are used by a text input logic unit. Theheader 602 also indicates whether the file includes frequencyinformation.

The header 602 is followed by the index table 604. The index table 604contains indexes in the file to the remaining tables which are definedbelow, and also allows for additional tables to be added. A table islocated using the index information found at the table's entry in theindex table 604.

The index table 604 is followed by the name table 606. The name table606 contains a name which identifies the word-list.

The name table 606 is followed by the character-mapping table 608. Thecharacter-mapping 608 table contains the alphabet being used for thisword-list, and maps each character in the alphabet to a character index.The alphabet consists of each unique character used in words in theword-list.

The character-mapping table 608 is followed by the substitution table610. The substitution table 610 contains a bi-gram substitution table,followed by a table for each group of higher-order n-grams which aredefined, such as tri-grams, four-grams, and so on. Each n-gram is mappedto a substitution index by the substitution table 610.

The substitution table 610 is followed by the offset table 612. Thistable is used to locate a word definition table, described below, basedon the common prefix of words in the word definition table to belocated. For each combination of two characters in the alphabet, thetable contains the offset in the file of a word definition table thatcontains words that start with that combination of characters. For emptygroups, the offset is equal to the next non-empty offset. Each offsetalso specifies whether the word definition table located at the offsetin the file is simple or complex, as described below.

Given a two-character sequence, the offset is located at the index inthe offset table defined by the formula: ((position of the firstcharacter in the alphabet*number of characters in the alphabet)+positionof the second character in the alphabet). For example, if the alphabetis English, then the size of the alphabet is 26, so the index of “ab” inthe offset table is ((0*26)+1), which equals 1. Hence, the size of theoffset table 612 is based on the length of the alphabet.

An inflection table, not shown, may optionally be included in thelinguistic data. The inflection table stores word suffixes which may beused in word definitions. A method of inflection analysis in illustratedin FIG. 9.

The linguistic data also contains word definition tables 614. A worddefinition table stores words from a single word group and frequenciesassociated with the words, and can be either simple or complex. A simpletable is used to define words which are grouped by two-characterprefixes only. A complex table is used to define words which are groupedby prefixes of greater lengths.

Words in the definition tables 614 are encoded using thecharacter-mapping table 608 and the substitution table 610. Thecharacters in the words are replaced with the corresponding characterindexes from the character-mapping table 608, and the n-grams that arein the substitution table 610 are replaced in the words with theircorresponding substitution indexes in the substitution table 610. Sincethe offset table 612 uniquely maps each bi-gram prefix in the alphabetto a location in the file that defines words that start with thatprefix, the prefixes do not need to be retained, and thus are removedfrom the word definitions.

Upper case words may optionally be marked with an additional specialcharacter. The special character is stored in the character-mappingtable 608, extending the alphabet with an additional character not usedin the language of the words in the word-list.

A simple word definition table contains the encoded words of a group,and the frequencies associated with the words. The frequencies arenormalized by applying a normalization function which converts thefrequencies so that their values are within a predetermined range. Onlythe maximum frequency of words in the group is stored with fullprecision in the table. All other frequencies are stored as percentagesof the maximum frequency. The encoded words are sorted by frequency.However, if learning capabilities are applied, as described below, thenthe initial sorting is no longer valid, and the encoded words may needto be resorted.

As will be appreciated by those skilled in the art, characters arerepresented in computer systems by sequences of bits. The words in theword definition tables 614 are separated by characters with the mostsignificant bit set. If a character has its most significant bit set,then it is the last character in a word. The character is then treatedas if its most significant bit were not set for the purpose ofdetermining the value of the character, so that the most significant bitdoes not affect the value of the character.

FIG. 7 is a block diagram of a complex word definition table. Thecomplex word definition table is recursive, in that it contains localword definition tables 708, each of which is a simple or complex worddefinition table as described above.

The local word definition tables 708 define words that are grouped byhigher order n-gram prefixes. Each of the local word definition tables708 stores words stored by the word definition table that have a commonprefix, where the common prefix for words in each of the local worddefinition tables 708 is longer than the common prefix for words in theword definition table. The common prefixes of words in the local worddefinition tables 708 are removed.

For example, if a word group includes words which start with the prefix“co”, and there more than 256 words that start with that prefix, thenthe complex word definition table for “co”-prefixed words contains localword definition tables 708 that define words that start with “corn”,“con”, and so on. The table for “com”-prefixed words could be a complexword definition table that further contains local word definition tables708 for words starting with “comm” and “comp”, while the table for“con”-prefixed words could be a simple word definition table thatcontains only words starting with “con”.

In addition to containing local word definition tables 708, each worddefinition table includes a local offset table 706, which is used tolocate each of the local word definition tables 708. Each offset alsoindicates whether the table that is referred to by the offset is acomplex or simple word definition table.

Each complex word definition table also includes a localcharacter-mapping table 704. This table is functionally the same as thecharacter-mapping table 608 (FIG. 6), except that it only containscharacters are included in words that are in local word definitiontables 708. The local character-mapping table 704 maps each character inthe words in the local word definition tables 708 to a local characterindex. Words in simple local word definition tables are encoded byreplacing characters in the words with the local character indexes.

A complex word definition table also contains a hotword table 700 and anexception table 702. Hotwords are the words associated with the highestfrequencies in the group contained in the complex word definition table.The hotword table 700 contains indexes of hotwords that are located inlocal word definition tables 708 that are simple word definition tables.The exception table 702 stores hotwords that are located in local worddefinition tables 708 that are complex word definition tables. A hotwordcan be retrieved quickly using the hotword table 700 and the exceptiontable 702, instead of performing a search of the local word definitiontables 708 to find the hotword.

The format of linguistic data described above enables determination ofword predictions very quickly, using a minimal amount of memory. When auser enters a word prefix using a text input device that maps charactersto unique keys or key combinations, such as a QWERTY keyboard, a textinput logic unit retrieves the words in the linguistic data that startwith the prefix having the highest frequencies, and presents thepredictions to the user. When the user starts to type a word using areduced keyboard, the word prefix is ambiguous, since each key on areduced keyboard is mapped to multiple characters. In this case, a textinput logic unit retrieves predictions from the linguistic data thatstart with any of the combinations of characters that correspond to theprefix entered by the user.

The format also allows for easy modification of the words' frequencies,to conform to individual user's text input habits. The user's habits,confirmed by the input choices he or she makes when presented with wordprediction alternatives, are learned by the text input logic unit andstored in tables including those described below.

Learning capabilities include the modification of frequency informationfor words, and the addition of words to the linguistic data. Bothoperations are based on similar processes of adding the words andcorresponding frequency information into a learning word-list. Thelearning word-list includes tables for frequency modification and forthe addition of new words.

FIG. 8 is a flowchart illustrating a method of frequency modification.The method proceeds on the assumption that the base linguistic data,which is the linguistic data compiled as described above before anylearning data is gathered, has correct frequency information in general.Therefore, the method allows for limited modification of the frequencyinformation.

The method starts with the step 802 of adding a user-selected word 800to the learning word-list. The user-selected word 800 is the wordselected by the user from the list of predicted words offered that beginwith a word prefix entered by the user. The user selects a predictionusing a text input device. The selected word is added to the learningword-list.

The method continues with step 804 of obtaining the word with themaximum frequency of words in the prediction list that was presented tothe user. The words in the prediction list and their correspondingfrequencies may have been obtained from the word definition tables inbase linguistic data, or from the learning word-list. If it isdetermined at step 806 that the word with maximum frequency was obtainedfrom the word definition tables, then the method continues at step 808,and the user-selected word 800 is assigned a frequency equal to themaximum frequency plus one.

If it is determined at step 806 that the word with maximum frequency wasnot obtained from the word definition tables, but was rather obtainedfrom the learning word-list, then the method continues at step 810, andthe user-selected word 800 is assigned a frequency that is equal to themaximum frequency. The method then ends with step 812 of deleting theword with maximum frequency obtained at step 804 from the learningword-list.

The following paragraphs are examples of the method illustrated in FIG.8. Each example assumes that the user enters a three-character prefix.

Given the three-character prefix of “beg”, and predictions “began”,which has a frequency of 3024, “begin”, which has a frequency of 2950,“beginning”, which has a frequency of 2880, and “begins”, which has afrequency of 2000, where all words are obtained from the word definitiontables in the base linguistic data, if the user selects the word“begin”, then the word “begin” is added to the learning word-list withthe frequency 3025.

Given the same three-character prefix “beg”, and predictions “begin”,which has a frequency of 3025, “began”, which has a frequency of 3024,“beginning”, which has a frequency of 2880, and “begins”, which has afrequency of 2000, where “begin” is obtained from the learningword-list, if the user selects “began”, then the word “began” is addedto learning word-list with the frequency 3025, and word “begin” isdeleted from learning word-list.

The following is an example of the method of FIG. 8 where thethree-character prefix is entered using a reduced keyboard. The reducedkeyboard includes a key for entering “a”, “b” or “c”, a key for entering“n” or “o”, and a key for entering “w”, “x”, or “y”. In this example, itis assumed that the user enters the three-character prefix by pressingthe “a/b/c” key, then the “n/o” key, and finally the “w/x/y” key. Giventhe predictions “any”, which has a frequency of 3024, “boy”, which has afrequency of 2950, “box”, which has a frequency of 2880, “bow”, whichhas a frequency of 2000, “cow”, which has a frequency of 1890, and“cox”, which has a frequency of 1002, where all of the words areobtained from word definition tables in the base linguistic data, ifuser selects “boy”, then the word “boy” is added to learning word-listwith a frequency 3025.

The learning word-list includes an updated frequencies table thatcontains words with updated frequency and a new words table thatcontains new words. Both of these tables include words which are encodedas in the base linguistic data, using the same character-mapping 608(FIG. 6) and substitution tables 610 (FIG. 6) as are used by the baselinguistic data. Each learning word-list table also includes indexes forthe beginnings of words in the table, frequency information associatedwith the words in the table, and a sorting index that specifies thealphabetically sorted order of the words. Each table also includes aflag which indicates whether the table contains updated frequencies ornew words. The learning word-list tables follow sequentially one afterthe other, with the updated frequencies table appearing first.

If the learning word-list tables reach a maximum-defined length, thenthe oldest words from the tables are deleted in order to make room fornew entries in the tables.

Adding words to and deleting words from a learning word-list table areperformed by creating the byte sequence representing the updated tableand simultaneously writing the byte sequence into an output stream.After the update is complete, the updated data is reread. The process ofwriting into an output stream occurs every time words are added ordeleted from the learning word-list.

In order to add or delete words from one of the learning word-listtables, the alphabet in the character-mapping table 608 (FIG. 6) isupdated if it doesn't contain the characters that appear in words to beadded. Words to be added are then encoded using the character-mappingtable 608 (FIG. 6) and the substitution table 610 (FIG. 6), and insertedinto the beginning of the new words table. Finally, the frequencies andsorting index of the learning word-list table are updated.

FIG. 9 is a flowchart illustrating a method of inflection analysis. Thesystem and method of creating compact linguistic data may alternativelyinclude the method of inflection analysis, in which both a list of wordsthat have frequencies higher than a minimum specified frequency and aninflection table are created. The inflection table is created based onstatistical suffix analysis, and encapsulates the linguistic rules forword creation in the language of the corpus. The inflection tables makeit possible to produce more than one word using the basic word formsstored in the inflection table, ensuring that more words are covered bythe linguistic data, while the basic word-list remains compact. Aninflection table may optionally be included in the linguistic dataformat shown in FIG. 6.

The method begins with the step 900 of finding a configured number ofwords that occur most frequently in the word-list, based on the absolutefrequency of the words.

The method continues with the step 901 of finding suffixes of thefrequently occurring words. The step of suffix finding is based on aniterative search of suffixes of decreasing length, starting withsuffixes that are six characters long and ending with suffixes that aretwo characters long. These suffixes do not always match the existingcounterparts in the grammar of the given language, but rather the suffixfinding is based on the number of occurrences of suffixes in theword-list.

The method continues with the step 902 of updating the inflection tablewith the suffixes found in the previous step. The first time step 902 isperformed, the inflection table is created before it is updated.

At step 903, if the size of the linguistic data is smaller than aconfigured maximum size, then the method continues at step 901.Otherwise, the method concludes with the step 904 of creating a list ofthe words in the word-list without the suffixes contained in theinflection table.

The inflection table and the list of words without suffixes can then beencoded as described above in reference to FIG. 5. When the method ofinflection analysis is used, the resulting compact linguistic data asillustrated in FIG. 6 also includes the inflection table. The words inthe word definition tables 614 (FIG. 6) then do not include the suffixesthat are included in the inflection table, but rather contain referencesto the suffixes in the inflection table. The space saved by using theinflection table for each suffix stored is the number of occurrences ofthe suffix, multiplied by the length of the suffix.

The above description relates to one example of the present invention.Many variations will be apparent to those knowledgeable in the field,and such variations are within the scope of the application.

For example, while the language used in most of the examples is English,the system and method provided creates compact linguistic data for anyalphabetical language.

In addition, the system and method of creating and using compactlinguistic data can be implemented as software, firmware, or hardware,or as a combination thereof, on personal computers, PDAs, cellulartelephones, two-way pagers, wearable computers of any sort, printers,set-top boxes and any other devices allowing text input and display.

Also, the methods illustrated in FIGS. 3, 4, 5, 8 and 9 may containfewer, more or different steps than those that are shown. For example,although the methods describe using computer files to store final andintermediate results of the methods, the results could also be stored incomputer memory such as RAM or Flash memory modules.

1. A system of creating compact linguistic data, comprising: a corpus;and a linguistic data analyzer, wherein the linguistic data analyzercalculates frequencies of words appearing in the corpus, maps eachunique character in the words to a character index, replaces eachcharacter in the words with the character index to which the characteris mapped, maps sequences of characters that appear in the words tosubstitution indexes, replaces each sequence of characters in each wordwith the substitution index to which the sequence of characters aremapped, arranges the words into groups where each group contains wordsthat start with a common prefix, and maps each prefix to locationinformation for the group of words which start with the prefix, andwherein the compact linguistic data includes the unique characters, thecharacter indexes, the substitution indexes, the location information,the groups of words, and the frequencies of the words. 2.-13. (canceled)14. A system of creating compact linguistic data, comprising: a corpus;and a linguistic data analyzer, wherein the linguistic data analyzercalculates frequencies of words appearing as independent words in thecorpus, maps each unique character in the words to a character index,replaces each character in the words with the character index to whichthe character is mapped, maps sequences of characters that appear in thewords to substitution indexes, replaces each sequence of characters ineach word with the substitution index to which the sequence ofcharacters are mapped, arranges the words into groups where each groupcontains words that start with a common prefix, maps each prefix tolocation information for the group of words which start with the prefixto create a prefix index, and removes the prefix from the words in thegroups of words; and wherein the compact linguistic data includes theunique characters, the character indexes, the substitution indexes, theprefix index, the groups of words, and the frequencies of the words. 15.The system of claim 14, further comprising: a user interface,comprising: a text input device; and a text output device; and a textinput logic unit, wherein the text input logic unit receives a textprefix from the text input device, retrieves a plurality of predictedwords from the compact linguistic data that start with the text prefix,selects one of the plurality of predicted words for display based on thefrequencies of the plurality of predicted words, and displays the onepredicted words using the text output device.
 16. The system of claim15, wherein the text input device is a keyboard.
 17. The system of claim16, wherein the keyboard is a reduced keyboard.
 18. The system of claim15, wherein the user interface and the text input logic unit areimplemented on a mobile communication device.
 19. The system of claim15, wherein the text input logic unit selects one of the groups of wordsas the plurality of predicted words.
 20. The system of claim 19, whereinthe selected group of words is selected based on the frequency of thegroup of words.
 21. The system of claim 15, wherein the text input logicunit is configured to update the frequencies of the words based onwhether the predicted word is input by a device user.
 22. The system ofclaim 14, wherein for each group, only the maximum frequency, which isthe highest frequency value in the group, is retained with fullprecision, and the frequencies of words with less than the maximumfrequency are retained as a percentage of the maximum frequency.
 23. Acomputer-implemented method of creating compact linguistic data,comprising: performing, by a processor, the operations of: calculatingfrequencies of words appearing as independent words in the corpus,mapping each unique character in the words to a character index,replacing each character in the words with the character index to whichthe character is mapped, mapping sequences of characters that appear inthe words to substitution indexes, replacing each sequence of charactersin each word with the substitution index to which the sequence ofcharacters are mapped, arranging the words into groups where each groupcontains words that start with a common prefix, mapping each prefix tolocation information for the group of words which start with the prefixto create a prefix index, and removing the prefix from the words in thegroups of words; and storing, in electronic format, the uniquecharacters, the character indexes, the substitution indexes, the prefixindex, the groups of words, the frequencies of the words and thefrequencies of the groups of words as compact linguistic data.
 24. Themethod of claim 23, further comprising: receiving a text prefix from atext input logic unit of a text input device, retrieving a plurality ofpredicted words from the compact linguistic data that start with thetext prefix, selecting one of the plurality of predicted words fordisplay based on the frequencies of the plurality of predicted words,and displaying the one predicted words using a text output device. 25.The method of claim 24, wherein the text input device is a keyboard. 26.The method of claim 25, wherein the keyboard is a reduced keyboard. 27.The method of claim 24, wherein the text input logic unit is implementedon a mobile communication device.
 28. The method of claim 24, whereinthe text input logic unit selects one of the groups of words as theplurality of predicted words.
 29. The method of claim 28, wherein theselected group of words is selected based on the frequency of the groupof words.
 30. The method of claim 24, wherein the text input logic unitis configured to update the frequencies of the words based on whetherthe predicted word is input by a device user.
 31. The method of claim23, wherein for each group, only the maximum frequency, which is thehighest frequency value in the group, is retained with full precision,and the frequencies of words with less than the maximum frequency areretained as a percentage of the maximum frequency.